If we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here
To define temperature anomalies you need to have a reference, or base, period which NASA clearly states that it is the period between 1951-1980.
Run the code below to load the file:
weather <-
read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv",
skip = 1,
na = "***")Notice that, when using this function, we added two options:
skip and na.
skip=1 option is there as the real data table only
starts in Row 2, so we need to skip one row.na = "***" option informs R how missing observations in
the spreadsheet are coded. When looking at the spreadsheet, you can see
that missing data is coded as “***”. It is best to specify this here, as
otherwise some of the data is not recognized as numeric data.Once the data is loaded, notice that there is a object titled
weather in the Environment panel. If you
cannot see the panel (usually on the top-right), go to
Tools > Global Options >
Pane Layout and tick the checkbox next to
Environment. Click on the weather object, and
the dataframe will pop up on a seperate tab. Inspect the dataframe.
For each month and year, the dataframe shows the deviation of temperature from the normal (expected). Further the dataframe is in wide format.
You have two objectives in this section:
Select the year and the twelve month variables from the
weather dataset. We do not need the others (J-D, D-N, DJF,
etc.) for this assignment. Hint: use select()
function.
Convert the dataframe from wide to ‘long’ format. Hint: use
gather() or pivot_longer() function. Name the
new dataframe as tidyweather, name the variable containing
the name of the month as month, and the temperature
deviation values as delta.
# Create tidyweather data
tidyweather <- weather %>%
select(-c("J-D", "D-N", "DJF", "JJA", "MAM", "SON")) %>%
pivot_longer(cols = -Year, names_to = "month", values_to = "delta")Inspect your dataframe. It should have three variables now, one each for
Let us plot the data using a time-series scatter plot, and add a
trendline. To do that, we first need to create a new variable called
date in order to ensure that the delta values
are plot chronologically.
In the following chunk of code, I used the
eval=FALSEargument, which does not run a chunk of code; I did so that you can knit the document before tidying the data and creating a new dataframetidyweather. When you actually want to run this code and knit your document, you must deleteeval=FALSE, not just here but in all chunks wereeval=FALSEappears.
# Set date format tidyweather data
tidyweather <- tidyweather %>%
mutate(date = ymd(paste(as.character(Year), month, "1")),
month = month(date, label=TRUE),
year = year(date))
# Plot weather anomalies
ggplot(tidyweather, aes(x=date, y = delta))+
geom_point()+
geom_smooth(color="red") +
theme_bw() +
labs (
title = "Weather Anomalies"
)Is the effect of increasing temperature more pronounced in some
months? Use facet_wrap() to produce a seperate scatter plot
for each month, again with a smoothing line. Your chart should
human-readable labels; that is, each month should be labeled “Jan”,
“Feb”, “Mar” (full or abbreviated month names are fine), not
1, 2, 3.
Answer:
The effect of increasing temperature is more pronounced in the winter months - starting from the end of fall until the start of spring. We see this from the higher values within the winter months over the other months across time. This reflects the effects of global warming/climate change throughout the last decade.
It is sometimes useful to group data into different time periods to
study historical data. For example, we often refer to decades such as
1970s, 1980s, 1990s etc. to refer to a period of time. NASA calcuialtes
a temperature anomaly, as difference form the base periof of 1951-1980.
The code below creates a new data frame called comparison
that groups data in five time periods: 1881-1920, 1921-1950, 1951-1980,
1981-2010 and 2011-present.
We remove data before 1800 and before using filter.
Then, we use the mutate function to create a new variable
interval which contains information on which period each
observation belongs to. We can assign the different periods using
case_when().
# Assign periods
comparison <- tidyweather %>%
filter(Year>= 1881) %>% #remove years prior to 1881
#create new variable 'interval', and assign values based on criteria below:
mutate(interval = case_when(
Year %in% c(1881:1920) ~ "1881-1920",
Year %in% c(1921:1950) ~ "1921-1950",
Year %in% c(1951:1980) ~ "1951-1980",
Year %in% c(1981:2010) ~ "1981-2010",
TRUE ~ "2011-present"
))Inspect the comparison dataframe by clicking on it in
the Environment pane.
Now that we have the interval variable, we can create a
density plot to study the distribution of monthly deviations
(delta), grouped by the different time periods we are
interested in. Set fill to interval to group
and colour the data by different time periods.
# Create density plot of monthly deviation based off each interval period
ggplot(comparison, aes(delta, fill = interval, alpha = 0.1)) +
geom_density() +
labs(title = "Denisty plots of monthly deviations", subtitle = "Separated by Intervals", x = "Delta", y = "Density") +
theme(plot.title = element_text(size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10)
)NOTE:
From 1881 until 1980, we did not observe huge anomalies in terms of temperature delta from the base period (1951-1980). However, this was not reflected past 1980; temperatures began rising more drastically from 1981-present. More strikingly, the average delta for the last period is around 0.5 higher than that of the previous period (1981-2010). This hints at even higher average deviations in future that come faster unless a downward pressure is exerted on temperature increases through climate change measures.
So far, we have been working with monthly anomalies. However, we
might be interested in average annual anomalies. We can do this by using
group_by() and summarise(), followed by a
scatter plot to display the result.
# Create yearly averages
average_annual_anomaly <- comparison %>%
group_by(Year) %>%
summarise(mean_delta_year = mean(delta))
# plot mean annual delta across years:
ggplot(average_annual_anomaly, aes(x = Year, y= mean_delta_year)) +
geom_point() +
geom_smooth(colour = "red", method = "loess") +
theme_bw() +
labs (
title = "Yearly Weather Anomalies",
y = "Mean annual delta"
)#Fit the best fit line, using LOESS method
#change theme to theme_bw() to have white background + black frame around plotNOTE:
Like what we were hinting at earlier, the yearly anomalies over the past 4 decades (starting from 1980) have followed a relatively exponential upward trend. During the 1980s, the average yearly delta was between 0 and 0.5 whereas this average has climbed past 1.5 in the most recent years. This increase of 1 degree overall has not been witnessed in the century prior to 1980, thus emphasising the magnitude of the rapidly worsening global warming phenomenon.
deltaNASA points out on their website that
A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.
Your task is to construct a confidence interval for the average
annual delta since 2011, both using a formula and using a bootstrap
simulation with the infer package. Recall that the
dataframe comparison has already grouped temperature
anomalies according to time intervals; we are only interested in what is
happening between 2011-present.
# Formula method of CI calculation
formula_ci <- comparison %>%
filter(interval == "2011-present") %>%
drop_na() %>%
summarise(mean_delta = mean(delta),
sd_delta = sd(delta),
count = n(),
t_critical = qt(0.975, count-1),
se_delta = sd_delta / sqrt(count),
margin_of_error = t_critical * se_delta,
delta_low = mean_delta - margin_of_error,
delta_high = mean_delta + margin_of_error
)
# Print out formula_CI
formula_ci## # A tibble: 1 × 8
## mean_delta sd_delta count t_critical se_delta margin_of_error delta_…¹ delta…²
## <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.07 0.266 139 1.98 0.0226 0.0446 1.02 1.11
## # … with abbreviated variable names ¹delta_low, ²delta_high
library(infer)
# Bootstrap method of CI calculation using Infer package
infer_bootstrap_ci <- comparison %>%
filter(interval == "2011-present") %>%
specify(response = delta) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean", na.rm = TRUE) %>%
get_confidence_interval(level = .95)
# Print out bootstrap CI using Infer package
infer_bootstrap_ci## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1.02 1.11
What is the data showing us? Please type your answer after (and outside!) this blockquote. You have to explain what you have done, and the interpretation of the result. One paragraph max, please!
Answer:
The results give an interesting insight into the sea and land temperature differences during the 2011 - present time interval. The mean of the temperature delta is around 1.06 degrees. By means of a confidence interval calculation we can state with 95% certainty that the mean temperature delta is between 1.02 and 1.11 degrees for this time period. As there is no underlying distribution, we used the t-statistics and a degrees of freedom of n-1 to calculate the critical values required for a confidence interval calculation. We confirmed this estimate by using a bootstrapping method that manually created the confidence intervals and provided a lower bound of 1.02 degrees and a upper bound of 1.11 degrees.
As we saw in class, fivethirtyeight.com has detailed data on all polls that track the president’s approval
# Import approval polls data directly off fivethirtyeight website
approval_polllist <- read_csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv')
glimpse(approval_polllist)## Rows: 4,495
## Columns: 22
## $ president <chr> "Joe Biden", "Joe Biden", "Joe Biden", "Joe Biden"…
## $ subgroup <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate <chr> "9/9/2022", "9/9/2022", "9/9/2022", "9/9/2022", "9…
## $ startdate <chr> "1/19/2021", "1/19/2021", "1/20/2021", "1/20/2021"…
## $ enddate <chr> "1/21/2021", "1/21/2021", "1/21/2021", "1/21/2021"…
## $ pollster <chr> "Rasmussen Reports/Pulse Opinion Research", "Morni…
## $ grade <chr> "B", "B", "B", "B+", "B", "B-", "B+", "B-", "B", "…
## $ samplesize <dbl> 1500, 15000, 1993, 1516, 15000, 1115, 941, 1200, 1…
## $ population <chr> "lv", "a", "rv", "a", "a", "a", "rv", "rv", "a", "…
## $ weight <dbl> 0.3382, 0.2594, 0.0930, 1.2454, 0.2333, 1.1014, 1.…
## $ influence <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve <dbl> 48.0, 50.0, 56.0, 45.0, 51.0, 55.5, 63.0, 58.0, 52…
## $ disapprove <dbl> 45.0, 28.0, 31.0, 28.0, 28.0, 31.6, 37.0, 32.0, 29…
## $ adjusted_approve <dbl> 49.2, 49.4, 55.4, 46.0, 50.4, 54.6, 59.5, 57.5, 51…
## $ adjusted_disapprove <dbl> 40.3, 30.9, 33.9, 29.0, 30.9, 32.5, 38.4, 32.8, 31…
## $ multiversions <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ tracking <lgl> TRUE, TRUE, NA, NA, TRUE, NA, NA, NA, TRUE, TRUE, …
## $ url <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id <dbl> 74247, 74272, 74246, 74327, 74273, 74248, 74256, 7…
## $ question_id <dbl> 139395, 139491, 139394, 139570, 139492, 139404, 13…
## $ createddate <chr> "1/22/2021", "1/28/2021", "1/22/2021", "2/2/2021",…
## $ timestamp <chr> "09:48:31 9 Sep 2022", "09:48:31 9 Sep 2022", "0…
# Use `lubridate` to fix dates, as they are given as characters.What I would like you to do is to calculate the average net approval
rate (approve- disapprove) for each week since he got into office. I
want you plot the net approval for each week in 2022, along with its 95%
confidence interval. There are various dates given for each poll, please
use enddate, i.e., the date the poll ended. Your plot
should look something like this:
# Create summary statistics of polls grouped by subgroups and across weeks in 2022
poll_summary <- approval_polllist %>%
mutate(enddate = mdy(enddate)) %>%
mutate(net_approval = adjusted_approve - adjusted_disapprove) %>%
mutate(year = lubridate::year(enddate), #Split date format into years and week
week = lubridate::week(enddate)) %>%
filter(year == 2022) %>%
select(c(subgroup, adjusted_approve, adjusted_disapprove, net_approval, year, week)) %>% #Select variables of interest
group_by(week, subgroup) %>%
summarise(mean_approval = mean(net_approval), #Create summary statisticcs and Confidence Intervals for each week and subgroup combination
sd_approval = sd(net_approval),
count = n(),
t_critical = qt(0.975, count-1),
se_approval = sd_approval / sqrt(count),
margin_of_error = t_critical * se_approval,
approval_low = mean_approval - margin_of_error,
approval_high = mean_approval + margin_of_error)
# Plot overall confidence interval for net approval across weeks in 2022
# Facet across subgroup type
ggplot(poll_summary, aes(week, group = subgroup, colour = subgroup)) +
geom_line(aes(y = mean_approval), show.legend = FALSE) +
geom_line(aes(y = approval_low), show.legend = FALSE) +
geom_line(aes(y = approval_high), show.legend = FALSE) +
geom_ribbon(aes(ymin = approval_low, ymax = approval_high, group = subgroup, fill = "red", alpha = 0.3), show.legend = FALSE) + #Create ribbon between upper and lower bounds
facet_grid(rows = vars(subgroup)) +
labs(title = "Biden's Net Approval Rating in 2022", subtitle = "Weekly Data, Approve - Disapprove, %", x = "Week in 2022", y = "%") +
theme_light() +
theme(plot.title = element_text(size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
strip.text.x = element_text(size = 10))NOTE:
Among all voters, regardless of subgroups, Biden’s net approval ratings have consistently been negative and falling. His net approval spiked past week 29 of 2022 possibly due to a sentiment vote for Joe Biden testing positive for coronavirus on the 21st of July 2022. On a more serious note, more probable reasons could be the passing of the Inflation Reduction Act, and the approval of a search warrant of former president Donald Trump’s house in Florida for classified documents; these both happened at the beginning of August 2022.
Recall the TfL data on how many bikes were hired every single day. We can get the latest data by running the following
url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"
# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2022-09-06T12%3A41%3A48/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20220911%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20220911T070936Z&X-Amz-Expires=300&X-Amz-Signature=d65c619d35d3fdeb465019e5ee6dc7ef4529195266da6d1099db04b6220152b7&X-Amz-SignedHeaders=host]
## Date: 2022-09-11 07:09
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 180 kB
## <ON DISK> C:\Users\ASUS\AppData\Local\Temp\RtmpaQ3mtm\file105d81edb7879.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
sheet = "Data",
range = cell_cols("A:B"))
# change dates to get year, month, and weeks
bike <- bike0 %>%
clean_names() %>%
rename (bikes_hired = number_of_bicycle_hires) %>%
mutate (year = year(day),
month = lubridate::month(day, label = TRUE),
week = isoweek(day))We can easily create a facet grid that plots bikes hired by month and year since 2015
However, the challenge I want you to work on is to reproduce the following two graphs.
The second one looks at percentage changes from the expected level of weekly rentals. The two grey shaded rectangles correspond to Q2 (weeks 14-26) and Q4 (weeks 40-52).
For both of these graphs, you have to calculate the expected number
of rentals per week or month between 2016-2019 and then, see how each
week/month of 2020-2022 compares to the expected rentals. Think of the
calculation
excess_rentals = actual_rentals - expected_rentals.
Should you use the mean or the median to calculate your expected rentals? Why?
Answer:
We should use the mean to calculate expected rentals. In our case, we wanted to smooth (by averaging) all of the data, including anomalies, across the years for rentals within each month/week; this gives a clearer picture of an “average” for each time period since the median statistic would automatically exclude the effects of outliers.
In creating your plots, you may find these links useful:
# Create expected monthly bike rental by averaging monthly data from 2016-2019
expected_monthly_rental_summary <- bike %>%
filter(year >= 2016 & year <= 2019) %>%
group_by(month) %>%
summarise(expected_rental_mean = mean(bikes_hired, na.rm = TRUE))
# Create actual monthly bike rental data from 2017-2022
monthly_summary_17_22 <- bike %>%
filter(year >= 2017) %>%
group_by(year, month) %>%
summarise(monthly_mean = mean(bikes_hired))
# Join both tables
monthly_full_summary <- monthly_summary_17_22 %>%
left_join(y = expected_monthly_rental_summary, by = "month")
# Plot monthly changes in bike rentals faceted on year
ggplot(monthly_full_summary) +
geom_line(aes(x = month, y = monthly_mean, group = year), colour = "black", show.legend = FALSE) + #Line created for actual bike rental
geom_line(aes(x = month, y = expected_rental_mean, group = year), colour = "blue", size = 1.2, show.legend = FALSE) + #Line created for expected bike rental
facet_wrap(~ year) +
geom_ribbon(aes(x = month, ymin = monthly_mean, ymax = pmax(monthly_mean, expected_rental_mean), group = year, fill = "red"), alpha = 0.3, show.legend = FALSE) + #Ribbon created for expected>actual
geom_ribbon(aes(x = month, ymin = expected_rental_mean, ymax = pmax(monthly_mean, expected_rental_mean), group = year, fill = "green"), alpha = 0.3, show.legend = FALSE) + #Ribbon created using actual>expected
scale_fill_manual(values=c("green", "red"), name="fill") + #change colour fill of ribbons
labs(title = "Monthly Changes in TfL bike rentals", subtitle = "Change from monthly average calculated between 2016 annd 2019", x = "Month", y = "Bike Rentals") +
theme_light() +
theme(plot.title = element_text(size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
axis.text.x = element_text(angle = 90),
strip.text.x = element_text(size = 10)) # Create expected weekly bike rental by averaging weekly data from 2016-2019
expected_weekly_rental_summary <- bike %>%
filter(year >= 2016 & year <= 2019) %>%
group_by(week) %>%
summarise(expected_rentals = mean(bikes_hired, na.rm = TRUE))
# Create actual weekly bike rental data from 2017-2022
weekly_summary_17_22 <- bike %>%
filter(year >= 2017) %>%
group_by(year, week) %>%
summarise(weekly_mean = mean(bikes_hired))
# Join both tables and create new column measuring weekly percent deviation
weekly_full_summary <- weekly_summary_17_22 %>%
left_join(y = expected_weekly_rental_summary, by = "week") %>%
mutate(percent_change_weekly = (weekly_mean/expected_rentals - 1) * 100) %>%
filter(!(week >= 52 & year == 2022)) %>% #filter date anomalies (due to the way R defines week 52 and 53)
mutate(up = ifelse(percent_change_weekly > 0, 1, "")) %>%
mutate(down = ifelse(percent_change_weekly < 0, 1, "")) #Create up and down columns for rugs later
# Plot weekly percent deviation in bike rentals faceted on year
ggplot(weekly_full_summary, aes(x = week)) +
facet_wrap(~ year) +
geom_rect(aes(xmin = 14, xmax = 26, ymin = -50, ymax = 100), fill = "grey", alpha = 0.1) + #Create rectangle representing Q2
geom_rect(aes(xmin = 40, xmax = 52, ymin = -50, ymax = 100), fill = "grey", alpha = 0.1) + #Create rectangle representing Q4
geom_line(aes(y = percent_change_weekly), show.legend = FALSE) +
geom_ribbon(aes(ymin = 0, ymax = pmin(0,percent_change_weekly)), alpha = 0.3, fill = "red", show.legend = FALSE) +
geom_ribbon(aes(ymin = percent_change_weekly, ymax = pmin(0,percent_change_weekly)), fill = "green", alpha = 0.3, show.legend = FALSE) + #Create ribbons
geom_rug(data = subset(weekly_full_summary, up == 1), color = "green", sides = "b", show.legend = FALSE) + #Create green rugs
geom_rug(data = subset(weekly_full_summary, down == 1), color = "red", sides = "b", show.legend = FALSE) + #Create red rugs
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
labs(title = "Weekly Changes in TfL bike rentals", subtitle = "% change from weekly average calculated between 2016 annd 2019", x = "Week", y = "Bike Rentals") +
theme_light() +
theme(plot.title = element_text(size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
strip.text.x = element_text(size = 10))NOTE:
We see that during the beginning of coronavirus lockdown periods, we observed a notable reduction in bike rentals. Conversely, past the lifting of general coronavirus measures in mid 2021, we have seen consistent positive excess rentals compared to the mean 2016-2019 bike rentals; this suggests that the local population was particularly inclined to get out of the house and into the fresh air by cycling more regularly. A similar pattern can be observed for the weekly analysis of bike rentals; bike rentals by more than 50%, as compared to the period from 2016-2019, in most weeks after the lockdown measures were lifted.
As usual, there is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.
Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!
As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?
Yes, we understand the code!
Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.
Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).
Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.